K-means clusterring on restaurant menus¶

The project categorizes 160K food orders from 100 different pizza shops. Multiple iterations of the ‘Bag of Words’ model alongside K-means clustering were used to solve the problem.

In [1]:
# imort libraries
import pandas as pd

# Import custom functions
from functions.data_preprocess import stopwords_n_stemming  
from functions.plotting import plotly_pie_chart
from functions.clustering import Clustering_functions 

# Instantiate class
clf = Clustering_functions()
[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\etsia\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!

Step1: data preprocess¶

In [2]:
# Read original dataset
original_dataset = pd.read_csv('orderItems.csv')
# Get product_category and product name
testset = original_dataset.iloc[0:len(original_dataset), [7,11]].values

corpusname = stopwords_n_stemming(testset[:,0])
corpuscat = stopwords_n_stemming(testset[:,1])

# Creating a pd DataFrame with original index, product category, name and price
dataset = original_dataset.iloc[0:len(testset), [9]]
indexx=list(range(0, len(testset)))
dataset.insert(0, "original index", indexx, True)
dataset.insert(1, "product_category_name", corpuscat, True)
dataset.insert(2, "product_name", corpusname, True)

dataset.head(10)
Out[2]:
original index product_category_name product_name product_type_price
0 0 gourmet pizza chicken cordon bleu pizza 16.95
1 1 starter tasti garlic bread marinara sauc mozzarella chees 4.75
2 2 beverag soda 4.45
3 3 pizza tradit plain chees pizza 8.00
4 4 side order onion ring 4.99
5 5 specialti pizza roma pizza 13.99
6 6 side order french fri gravi 3.50
7 7 greek specialti lamb beef gyro platter 11.99
8 8 pasta dish chicken il palio pasta 8.99
9 9 specialti pizza tandori chicken pizza 15.99

Step 2: Define plotting parameters¶

In [3]:
# define the plotting package
# Jupyter notebook is able to create interactive plotly figures
plotting_package = 'plotly' 

# define if figures will be exported locally
export_graph = True

Step 3: Split pizzas from the rest of the products(using category)¶

In [4]:
pizza_df = dataset[dataset['product_category_name'].str.contains(r'pizza')].copy()
nonpizza_df = dataset[~dataset['product_category_name'].str.contains(r'pizza')].copy()

# create a pizza - non pizza pie chart
labels = 'pizzas', 'non-pizza products'
plot_title = 'pizza - non-pizza product distribution'
sizes = [(len(pizza_df)/len(dataset))*100, (len(nonpizza_df)/len(dataset))*100]
fig = plotly_pie_chart(labels, sizes, plot_title, export_graph)

Step 4: Define the clustering variables¶

In [5]:
nclusters_pizza = 30 # the number of clusters
nclusters_cat_nopizza, nclusters_name_nopizza = 30, 15 # the number of clusters
max_features = 50 # the maximum amount of features for the Bag of Words

Step 5: Pizza clustering¶

In [6]:
# Conduct K-means clustering based on product category
print('K-means clustering: pizza products, by product-category')
cat_y_kmeans, cat_clusternames, pizza_categories_df = clf.complete_clustering(pizza_df, 1,\
     nclusters_pizza, max_features, 'Initial pizza categories', 'predicted_category', plotting_package, export_graph)
# Conduct K-means clustering based on product name
print('K-means clustering: pizza products, by product-name')
name_y_kmeans, name_clusternames, pizza_names_df = clf.complete_clustering(pizza_df, 2,
                 nclusters_pizza, max_features, 'Pizza products', 'predicted_name', plotting_package, export_graph)

# Update the pizza dataframe
pizza_df.insert(4, 'predicted_category', pizza_categories_df['predicted_category'])
pizza_df.insert(5, 'predicted_name', pizza_names_df['predicted_name'])

pizza_df.head(20)
K-means clustering: pizza products, by product-category
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  2042.09
K-means clustering: pizza products, by product-name
Complete loss of information will occur for 0.08% of products
initial clustering inertia:  14307.34
Out[6]:
original index product_category_name product_name product_type_price predicted_category predicted_name
0 0 gourmet pizza chicken cordon bleu pizza 16.95 gourmet pizza chicken pizza
3 3 pizza tradit plain chees pizza 8.00 pizza chees pizza plain
5 5 specialti pizza roma pizza 13.99 pizza specialti pizza
9 9 specialti pizza tandori chicken pizza 15.99 pizza specialti chicken pizza
11 11 pizza chees pizza 12.99 pizza chees pizza
13 13 specialti pizza hawaiian pizza 17.49 pizza specialti pizza
16 16 deep dish pizza chicago deep dish pizza 17.00 deep dish pizza chees deep dish pizza
18 18 classic new york pizza famou chees pizza 6.99 classic new pizza york chees famou pizza
19 19 pizza chees pizza 17.95 pizza chees pizza
25 25 specialti pizza amalfi pizza 18.99 pizza specialti pizza
28 28 pizza chees pizza 13.99 pizza chees pizza
32 32 new york style pizza creat new york pizza 17.00 new pizza style york creat new pizza york
34 34 gourmet pizza amigo combo climax pizza 24.99 gourmet pizza pizza
40 40 new york style pizza creat new york pizza 17.00 new pizza style york creat new pizza york
44 44 specialti pizza meat combo pizza 16.99 pizza specialti pizza
45 45 pizza chees pizza 9.75 pizza chees pizza
47 47 new york style gourmet pizza veggi suprem pizza 19.00 gourmet new pizza style york pizza veggi
48 48 new york style pizza creat new york pizza 17.00 new pizza style york creat new pizza york
53 53 tradit ny pizza chees pizza 10.99 ny pizza tradit chees pizza
56 56 specialti pizza tandori chicken pizza 17.99 pizza specialti chicken pizza

Step 6: Non-pizza clustering¶

In [7]:
#Conduct K-means clustering on product category
print('K-means clustering: non-pizza products, by product-category')
cat_y_kmeans, cat_clusternames, nonpizza_categories_df = clf.complete_clustering(nonpizza_df, 1,\
     nclusters_cat_nopizza, max_features, 'Initial nonpizza categories', 'predicted_category', plotting_package, export_graph)
    
# Update the nonpizza dataframe
nonpizza_df.insert(4, 'predicted_category', nonpizza_categories_df['predicted_category'])

# Conduct K-means clustering based on product name
nonpizza_df.insert(5, 'predicted_name', '') # create an empty column to be updated
for jj in range(nclusters_cat_nopizza):
    print('K-means clustering: ' + cat_clusternames[jj] + ' products, by product-name')
    
    # Get the data sub-set
    target_product_cat = nonpizza_df[nonpizza_df['predicted_category'] == cat_clusternames[jj]].copy()
       
    # Conduct K-means clustering
    name_y_kmeans, name_clusternames, target_product_names = clf.complete_clustering(target_product_cat, 2,\
                 nclusters_name_nopizza, max_features, cat_clusternames[jj], 'predicted_name', plotting_package, export_graph)
    
    # Update the nonpizza dataframe
    nonpizza_df.update(target_product_names) 
    
    del target_product_cat, name_y_kmeans, name_clusternames, target_product_names
K-means clustering: non-pizza products, by product-category
Complete loss of information will occur for 3.1% of products
initial clustering inertia:  21471.89
K-means clustering: appet products, by product-name
Complete loss of information will occur for 4.34% of products
initial clustering inertia:  9506.99
K-means clustering: display non product products, by product-name
Complete loss of information will occur for 1.18% of products
initial clustering inertia:  9808.23
K-means clustering: various products, by product-name
Complete loss of information will occur for 10.01% of products
initial clustering inertia:  10026.3
K-means clustering: hot sub products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  2350.28
K-means clustering: beverag products, by product-name
Complete loss of information will occur for 0.35% of products
initial clustering inertia:  531.15
K-means clustering: sandwich products, by product-name
Complete loss of information will occur for 0.1% of products
initial clustering inertia:  2450.85
K-means clustering: salad products, by product-name
Complete loss of information will occur for 0.04% of products
initial clustering inertia:  3211.72
K-means clustering: order side products, by product-name
Complete loss of information will occur for 2.34% of products
initial clustering inertia:  3366.2
K-means clustering: wing products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  814.03
K-means clustering: dish pasta products, by product-name
Complete loss of information will occur for 1.84% of products
initial clustering inertia:  2672.54
K-means clustering: special products, by product-name
Complete loss of information will occur for 4.53% of products
initial clustering inertia:  3198.57
K-means clustering: dessert products, by product-name
Complete loss of information will occur for 2.16% of products
initial clustering inertia:  1803.42
K-means clustering: calzon stromboli products, by product-name
Complete loss of information will occur for 0.25% of products
initial clustering inertia:  807.54
K-means clustering: wing products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  814.03
K-means clustering: wrap products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  1077.42
K-means clustering: pasta products, by product-name
Complete loss of information will occur for 0.57% of products
initial clustering inertia:  1624.8
K-means clustering: calzon products, by product-name
Complete loss of information will occur for 0.17% of products
initial clustering inertia:  980.4
K-means clustering: steak products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  280.25
K-means clustering: cold sub products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  496.5
K-means clustering: side products, by product-name
Complete loss of information will occur for 1.17% of products
initial clustering inertia:  735.85
K-means clustering: cheesesteak products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  256.78
K-means clustering: entre products, by product-name
Complete loss of information will occur for 1.7% of products
initial clustering inertia:  1100.75
K-means clustering: specialti products, by product-name
Complete loss of information will occur for 1.33% of products
initial clustering inertia:  1068.46
K-means clustering: kid menu products, by product-name
Complete loss of information will occur for 1.47% of products
initial clustering inertia:  919.65
K-means clustering: hot sandwich products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  1210.95
K-means clustering: burger products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  313.99
K-means clustering: grinder products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  715.62
K-means clustering: sub products, by product-name
Complete loss of information will occur for 0.0% of products
initial clustering inertia:  757.56
K-means clustering: item popular products, by product-name
Complete loss of information will occur for 1.43% of products
initial clustering inertia:  189.62
K-means clustering: chicken products, by product-name
Complete loss of information will occur for 0.13% of products
initial clustering inertia:  471.46

Step 7: Merge pizza and nonpizza results in the orginaldataset¶

In [8]:
final_df = original_dataset.copy()
final_df.insert(12, 'predicted_category', '')
final_df.insert(13, 'predicted_name', '')
final_df.update(nonpizza_df)
final_df.update(pizza_df)
In [9]:
pizza_df.head(20)
Out[9]:
original index product_category_name product_name product_type_price predicted_category predicted_name
0 0 gourmet pizza chicken cordon bleu pizza 16.95 gourmet pizza chicken pizza
3 3 pizza tradit plain chees pizza 8.00 pizza chees pizza plain
5 5 specialti pizza roma pizza 13.99 pizza specialti pizza
9 9 specialti pizza tandori chicken pizza 15.99 pizza specialti chicken pizza
11 11 pizza chees pizza 12.99 pizza chees pizza
13 13 specialti pizza hawaiian pizza 17.49 pizza specialti pizza
16 16 deep dish pizza chicago deep dish pizza 17.00 deep dish pizza chees deep dish pizza
18 18 classic new york pizza famou chees pizza 6.99 classic new pizza york chees famou pizza
19 19 pizza chees pizza 17.95 pizza chees pizza
25 25 specialti pizza amalfi pizza 18.99 pizza specialti pizza
28 28 pizza chees pizza 13.99 pizza chees pizza
32 32 new york style pizza creat new york pizza 17.00 new pizza style york creat new pizza york
34 34 gourmet pizza amigo combo climax pizza 24.99 gourmet pizza pizza
40 40 new york style pizza creat new york pizza 17.00 new pizza style york creat new pizza york
44 44 specialti pizza meat combo pizza 16.99 pizza specialti pizza
45 45 pizza chees pizza 9.75 pizza chees pizza
47 47 new york style gourmet pizza veggi suprem pizza 19.00 gourmet new pizza style york pizza veggi
48 48 new york style pizza creat new york pizza 17.00 new pizza style york creat new pizza york
53 53 tradit ny pizza chees pizza 10.99 ny pizza tradit chees pizza
56 56 specialti pizza tandori chicken pizza 17.99 pizza specialti chicken pizza